19 research outputs found
Enabling Interactive Analytics of Secure Data using Cloud Kotta
Research, especially in the social sciences and humanities, is increasingly
reliant on the application of data science methods to analyze large amounts of
(often private) data. Secure data enclaves provide a solution for managing and
analyzing private data. However, such enclaves do not readily support discovery
science---a form of exploratory or interactive analysis by which researchers
execute a range of (sometimes large) analyses in an iterative and collaborative
manner. The batch computing model offered by many data enclaves is well suited
to executing large compute tasks; however it is far from ideal for day-to-day
discovery science. As researchers must submit jobs to queues and wait for
results, the high latencies inherent in queue-based, batch computing systems
hinder interactive analysis. In this paper we describe how we have augmented
the Cloud Kotta secure data enclave to support collaborative and interactive
analysis of sensitive data. Our model uses Jupyter notebooks as a flexible
analysis environment and Python language constructs to support the execution of
arbitrary functions on private data within this secure framework.Comment: To appear in Proceedings of Workshop on Scientific Cloud Computing,
Washington, DC USA, June 2017 (ScienceCloud 2017), 7 page
Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud
Distributed communities of researchers rely increasingly on valuable, proprietary, or sensitive datasets. Given the growth of such data, especially in fields new to data-driven research like the social sciences and humanities, coupled with what are often strict and complex data-use agreements, many research communities now require methods that allow secure, scalable and cost-effective storage and analysis. Here we present CLOUD KOTTA: a cloud-based data management and analytics framework. CLOUD KOTTA delivers an end-to-end solution for coordinating secure access to large datasets, and an execution model that provides both automated infrastructure scaling and support for executing analytics near to the data. CLOUD KOTTA implements a fine-grained security model ensuring that only authorized users may access, analyze, and download protected data. It also implements automated methods for acquiring and configuring low-cost storage and compute resources as they are needed. We present the architecture and implementation of CLOUD KOTTA and demonstrate the advantages it provides in terms of increased performance and flexibility. We show that CLOUD KOTTA’s elastic provisioning model can reduce costs by up to 16x when compared with statically provisioned models
The Changing Role of RSEs over the Lifetime of Parsl
This position paper describes the Parsl open source research software project
and its various phases over seven years. It defines four types of research
software engineers (RSEs) who have been important to the project in those
phases; we believe this is also applicable to other research software projects.Comment: 3 page
Developing Distributed High-performance Computing Capabilities of an Open Science Platform for Robust Epidemic Analysis
COVID-19 had an unprecedented impact on scientific collaboration. The
pandemic and its broad response from the scientific community has forged new
relationships among domain experts, mathematical modelers, and scientific
computing specialists. Computationally, however, it also revealed critical gaps
in the ability of researchers to exploit advanced computing systems. These
challenging areas include gaining access to scalable computing systems, porting
models and workflows to new systems, sharing data of varying sizes, and
producing results that can be reproduced and validated by others. Informed by
our team's work in supporting public health decision makers during the COVID-19
pandemic and by the identified capability gaps in applying high-performance
computing (HPC) to the modeling of complex social systems, we present the
goals, requirements, and initial implementation of OSPREY, an open science
platform for robust epidemic analysis. The prototype implementation
demonstrates an integrated, algorithm-driven HPC workflow architecture,
coordinating tasks across federated HPC resources, with robust, secure and
automated access to each of the resources. We demonstrate scalable and
fault-tolerant task execution, an asynchronous API to support fast
time-to-solution algorithms, an inclusive, multi-language approach, and
efficient wide-area data management. The example OSPREY code is made available
on a public repository
A Composition-Transferable Machine Learning Potential for LiCl-KCl Molten Salts Validated by HEXRD
Unraveling the liquid structure of multi-component molten salts is challenging due to the difficulty
in conducting and interpreting high temperature diffraction experiments. Motivated by this challenge, we developed composition-transferable Gaussian Approximation Potentials (GAP) for molten
LiCl-KCl. A DFT-SCAN accurate GAP is active learned from only ~1100 training configurations
drawn from 10 unique mixture compositions enriched with metadynamics. The GAP-computed
structures show strong agreement across HEXRD experiments, including for a eutectic not explicitly included in model training, thereby opening the possibility for composition discovery
DLHub: Model and Data Serving for Science
While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its selfservice model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.
Comment: 10 pages, 8 figures, conference pape